[SPARK-56910][SQL] Simplify Cast to byte/short codegen under ANSI mode#55935
[SPARK-56910][SQL] Simplify Cast to byte/short codegen under ANSI mode#55935gengliangwang wants to merge 1 commit into
Conversation
c7b2229 to
1c34f53
Compare
1c34f53 to
2c1ab79
Compare
|
Audited this PR for the same lessons surfaced by @viirya and @cloud-fan on #55938 (and applied to #55934 / #55939):
So no changes needed on this PR for that review. |
Extend `CastUtils.java` with helpers for `byte` and `short` ANSI cast
targets and use them from `Cast.scala`. Drops the byte/short-target
dispatch (and the now-unused `lowerAndUpperBound` Scala helper) added
in SPARK-56909 -- after this PR, all integral and fractional narrowing
ANSI casts share the same `CastUtils.<...>Exact` one-line codegen.
Helpers added:
* `shortToByteExact(short)`, `intToByteExact(int)`, `longToByteExact(long)`
* `intToShortExact(int)`, `longToShortExact(long)`
* `floatToByteExact(float)`, `doubleToByteExact(double)`
* `floatToShortExact(float)`, `doubleToShortExact(double)`
`Cast.scala` changes:
* `castIntegralTypeToIntegralTypeExactCode` / `castFractionToIntegralTypeCode`
no longer dispatch on target type -- the helper-name pattern
`${integralPrefix(from)}To${target.capitalize}Exact` covers all four
target types.
* Eval paths for `castToByte` and `castToShort` add ANSI cases for
`ShortType` / `IntegerType` / `LongType` / `FloatType` / `DoubleType`
source types that delegate to the new helpers; the existing
`exactNumeric.toInt(b) + bounds-check` fallback now only handles the
remaining `Decimal` source.
Part of SPARK-56908 (umbrella). The original byte/short ANSI cast bodies
were 5 lines each across 8 call sites; this PR collapses them to one
line per call site, matching the int/long target work from SPARK-56909.
No. The compiled behavior is identical; only the emitted Java source
text changes.
```
build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite \
*CastWithAnsiOffSuite *AnsiCastSuite *TryCastSuite \
*ExpressionClassIdentitySuite"
```
312/312 pass.
Generated-by: Cursor 1.x
2c1ab79 to
c348b68
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem: Cast.scala had two helpers (castIntegralTypeToIntegralTypeExactCode, castFractionToIntegralTypeCode) that emitted multi-line Java bodies for ANSI byte/short narrowing — an inline if (overflow) throw for integral sources and a 5-line floor/ceil bounds-check block for fractional sources, across 8 codegen call sites. #55934 already collapsed the int/long target case by routing through existing *ExactNumeric Scala objects, but those objects don't expose cross-type narrowing to byte/short (their toByte(byte) / toShort(short) are same-type identities), so the byte/short case still had the inline blocks.
Design approach: Introduce a CastUtils.java helper class holding 9 single-line static narrowing methods. Each method holds source/target DataType references in private static final fields, eliminating per-row references[] lookups vs. the old codegen path. Both byte/short branches in the two codegen helpers, and the corresponding eval-path ANSI cases (castToByte / castToShort), route through the new helpers. A Java class rather than a Scala object avoids the getClass.getCanonicalName.stripSuffix("$") boilerplate that Scala-object access from codegen requires.
Key design decisions:
- Java helper class rather than extending
*ExactNumeric(which would force public API changes on those objects for a narrow internal use case). - Static
DataTypefields rather than passingfrom/tothrough codegenreferences[](small per-row win). - Per-source-type case in eval paths (
case IntegerType if ansiEnabled => ...) rather than continuing to flow through the genericcase x: NumericType if ansiEnabledblock; Decimal still falls through.
Implementation sketch:
CastUtils.java: 9 static helpers with single-statement bodies, shared error path viaQueryExecutionErrors.castingCauseOverflowError.Cast.scala: byte/short branches emitCastUtils.${prefix}To${Target}Exact($c); two newintegralPrefix/fractionalPrefixprivate helpers mapDataTypeto method-name prefix; the no-longer-usedlowerAndUpperBoundhelper is removed.
Behavior verification: Traced edge cases (NaN, ±Inf, large floats, boundary values 127.5f / -128.5f / 128.0f, integer-but-not-byte values like 200, in-range values) for all source types — old and new paths produce identical results and identical error-message arguments (value, source type, target type). PR description's claim that only emitted Java text changes is verified.
LGTM modulo one optional follow-up below.
| // Byte / short narrowing: call the matching CastUtils helper. Existing *ExactNumeric | ||
| // objects don't expose cross-type narrowing to byte / short (their toByte / toShort are | ||
| // same-type identities), so a Java helper is the cleanest fit. | ||
| val castUtils = classOf[CastUtils].getName |
There was a problem hiding this comment.
nit (optional, fine to defer to a follow-up): now that this else-branch no longer calls ctx.addReferenceObj("from"/"to", ...), both ctx: CodegenContext and to: DataType are unused in castIntegralTypeToIntegralTypeExactCode — and the same is true for castFractionToIntegralTypeCode just below. Since the SPARK-56908 series is about simplification, it would be consistent to drop both params from both helpers (and from the 5+5 = 10 call sites). Up to you whether to do it here or as a tiny follow-up.
|
@cloud-fan Thanks for the review. Merging to master/4.x |
### What changes were proposed in this pull request? Introduce `CastUtils.java` with nine static helpers for ANSI overflow-checked narrowing to `byte` / `short`, and use them from `Cast.scala` (both codegen and eval paths). Helpers added: * `shortToByteExact(short)`, `intToByteExact(int)`, `longToByteExact(long)` * `intToShortExact(int)`, `longToShortExact(long)` * `floatToByteExact(float)`, `doubleToByteExact(double)` * `floatToShortExact(float)`, `doubleToShortExact(double)` `ByteExactNumeric` / `ShortExactNumeric` only expose same-type identity narrowing (their `toByte(byte)` / `toShort(short)` are trivial), so unlike the `int` / `long` targets refactored in #55934 — which delegate to `LongExactNumeric.toInt` / `FloatExactNumeric.toInt` / `DoubleExactNumeric.toInt` / `toLong` — there is no existing Scala object to route the byte/short narrowing through. The Java helper is the cleanest fit. `Cast.scala` changes: * `castIntegralTypeToIntegralTypeExactCode`: the `byte` / `short` branch (previously an inline 5-line if/throw block) emits a single `CastUtils.${integralPrefix(from)}To${target.capitalize}Exact($c)` call. The `int` branch (added in #55934) is unchanged. * `castFractionToIntegralTypeCode`: the `byte` / `short` branch (previously an inline 5-line floor/ceil block plus `lowerAndUpperBound`) emits a single `CastUtils.${fractionalPrefix(from)}To${target.capitalize}Exact($c)` call. The `int` / `long` branch (added in #55934) is unchanged. The now-unused `lowerAndUpperBound` Scala helper is removed. * Eval paths for `castToByte` and `castToShort` add ANSI cases for `ShortType` / `IntegerType` / `LongType` / `FloatType` / `DoubleType` source types that delegate to the new helpers, replacing the existing multi-line `exactNumeric.toInt(b) + bounds-check` body. * Two small `integralPrefix(from: DataType)` / `fractionalPrefix(from: DataType)` Scala helpers handle the method-name dispatch. ### Why are the changes needed? Part of SPARK-56908 (umbrella). The byte/short narrowing ANSI bodies were 5 lines each across 8 codegen call sites; this PR collapses them to one line per call site, matching the int/long target work merged in #55934. ### Does this PR introduce _any_ user-facing change? No. The compiled behavior is identical; only the emitted Java source text changes. ### How was this patch tested? ``` build/sbt "catalyst/testOnly *CastSuite *CastWithAnsiOnSuite *CastWithAnsiOffSuite *AnsiCastSuite *TryCastSuite" ``` 307/307 pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x Closes #55935 from gengliangwang/SPARK-56910-cast-byte-short. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 6165bb0) Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
Introduce
CastUtils.javawith nine static helpers for ANSI overflow-checked narrowing tobyte/short, and use them fromCast.scala(both codegen and eval paths).Helpers added:
shortToByteExact(short),intToByteExact(int),longToByteExact(long)intToShortExact(int),longToShortExact(long)floatToByteExact(float),doubleToByteExact(double)floatToShortExact(float),doubleToShortExact(double)ByteExactNumeric/ShortExactNumericonly expose same-type identity narrowing (theirtoByte(byte)/toShort(short)are trivial), so unlike theint/longtargets refactored in #55934 — which delegate toLongExactNumeric.toInt/FloatExactNumeric.toInt/DoubleExactNumeric.toInt/toLong— there is no existing Scala object to route the byte/short narrowing through. The Java helper is the cleanest fit.Cast.scalachanges:castIntegralTypeToIntegralTypeExactCode: thebyte/shortbranch (previously an inline 5-line if/throw block) emits a singleCastUtils.${integralPrefix(from)}To${target.capitalize}Exact($c)call. Theintbranch (added in [SPARK-56909][SQL] Simplify Cast to int/long codegen under ANSI mode #55934) is unchanged.castFractionToIntegralTypeCode: thebyte/shortbranch (previously an inline 5-line floor/ceil block pluslowerAndUpperBound) emits a singleCastUtils.${fractionalPrefix(from)}To${target.capitalize}Exact($c)call. Theint/longbranch (added in [SPARK-56909][SQL] Simplify Cast to int/long codegen under ANSI mode #55934) is unchanged. The now-unusedlowerAndUpperBoundScala helper is removed.castToByteandcastToShortadd ANSI cases forShortType/IntegerType/LongType/FloatType/DoubleTypesource types that delegate to the new helpers, replacing the existing multi-lineexactNumeric.toInt(b) + bounds-checkbody.integralPrefix(from: DataType)/fractionalPrefix(from: DataType)Scala helpers handle the method-name dispatch.Why are the changes needed?
Part of SPARK-56908 (umbrella). The byte/short narrowing ANSI bodies were 5 lines each across 8 codegen call sites; this PR collapses them to one line per call site, matching the int/long target work merged in #55934.
Does this PR introduce any user-facing change?
No. The compiled behavior is identical; only the emitted Java source text changes.
How was this patch tested?
307/307 pass.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.x